Lab Assignment One: Exploring Table Data
Richmond Aisabor
Cars are an essential part of our daily lives because they provide a safe and affordable means of transportation. Depending on where you live, there may not be any other alternative to car ownership. Eventhough cars provide transportation, they are usually viewd as a liability so the decision to buy a car is important and has implications to one's standard of living.
In this study, the Car Evaulation Dataset will be used to build a model that can classify cars according to their quality. The dataset has 1728 observations and 6 features, including categorical and numerical features. The dataset was derived from a hierarchical decision model and each feature can be places into one of three concepts: Price, Tech and Comfort. These concepts form the decision making system. The dataset fulfills the lab requirements and is free to download at the UC Irvine Machine Learning Repository.
The user that would benefit the most from a successful classification is a first time car buyer. First time buyers usually have minimal experience with cars and making a decision to buy a car with incomplete information. The information that they due recieve is from a sales representive, who has an incentive to sell the car despite how well the car can actually perform. If a device can determine which car is the best, then the user will have no issues purchasing a car. To be confident that the algorithm is learning properly, it must have a success rate better than 50%, a random chance. The goal for the algorithm is to be as close to 100% accuracy as possible.
# load the car-evaluation dataset
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import copy
df = pd.read_csv("cardata.csv") # read in the csv file
df_original = copy.deepcopy(df) #create copy of original data frame
# replace the column name of 'sales' to 'department'
df = df.rename(columns = {'maint': 'maintenance_price', 'buying': 'buying_price', 'doors': 'number_doors', 'lug_boot': 'boot_space'})
# replace buying price to numrical indicator
df.buying_price.replace(to_replace = ['low', 'med', 'high', 'vhigh'],
value = range(1,5), inplace = True)
# replace maintenance price to numrical indicator
df.maintenance_price.replace(to_replace = ['low', 'med', 'high', 'vhigh'],
value = range(1,5), inplace = True)
# replace boot space to numrical indicator
df.boot_space.replace(to_replace = ['small', 'med', 'big'],
value = range(1,4), inplace = True)
# replace safety to numrical indicator
df.safety.replace(to_replace = ['low', 'med', 'high'],
value = range(1,4), inplace = True)
# replace safety to numrical indicator
df.quality.replace(to_replace = ['unacc', 'acc', 'good', 'vgood'],
value = range(1,5), inplace = True)
# store names of features
feature_names= list(df.columns.values)
df.head()
| buying_price | maintenance_price | number_doors | persons | boot_space | safety | quality | |
|---|---|---|---|---|---|---|---|
| 0 | 4 | 4 | 2 | 2 | 1 | 1 | 1 |
| 1 | 4 | 4 | 2 | 2 | 1 | 2 | 1 |
| 2 | 4 | 4 | 2 | 2 | 1 | 3 | 1 |
| 3 | 4 | 4 | 2 | 2 | 2 | 1 | 1 |
| 4 | 4 | 4 | 2 | 2 | 2 | 2 | 1 |
# Find the data types
print(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1728 entries, 0 to 1727 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 buying_price 1728 non-null int64 1 maintenance_price 1728 non-null int64 2 number_doors 1728 non-null object 3 persons 1728 non-null object 4 boot_space 1728 non-null int64 5 safety 1728 non-null int64 6 quality 1728 non-null int64 dtypes: int64(5), object(2) memory usage: 94.6+ KB None
# find the data summary
df.describe()
| buying_price | maintenance_price | boot_space | safety | quality | |
|---|---|---|---|---|---|
| count | 1728.000000 | 1728.000000 | 1728.000000 | 1728.000000 | 1728.000000 |
| mean | 2.500000 | 2.500000 | 2.000000 | 2.000000 | 1.414931 |
| std | 1.118358 | 1.118358 | 0.816733 | 0.816733 | 0.740700 |
| min | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 1.750000 | 1.750000 | 1.000000 | 1.000000 | 1.000000 |
| 50% | 2.500000 | 2.500000 | 2.000000 | 2.000000 | 1.000000 |
| 75% | 3.250000 | 3.250000 | 3.000000 | 3.000000 | 2.000000 |
| max | 4.000000 | 4.000000 | 3.000000 | 3.000000 | 4.000000 |
After reviewing the information dataframe, there are no missing values in the dataset as each column holds 1728 entries in each feature (the dataset has a total of 1728 observations).
If there were missing values, an imputation procedure could be used to correct the missing entries. Mean and median imputation can be used for numerical values, while mode can be used for categorical values.
The features to discard are number_doors and persons because these features are not useful enough for determining the quality of a car. How many people a car can contain and the number of doors it has imply the size of the vehicle. Vehicle size is a matter of personal preference because a smaller vehicle isn't necessarily less valuable than a larger vehicle and vice versa.
# remove columns that are not useful
df_reshape = copy.deepcopy(df)
df_reshape.drop(columns=['number_doors', 'persons'], axis=1, inplace=True)
df_reshape.head()
| buying_price | maintenance_price | boot_space | safety | quality | |
|---|---|---|---|---|---|
| 0 | 4 | 4 | 1 | 1 | 1 |
| 1 | 4 | 4 | 1 | 2 | 1 |
| 2 | 4 | 4 | 1 | 3 | 1 |
| 3 | 4 | 4 | 2 | 1 | 1 |
| 4 | 4 | 4 | 2 | 2 | 1 |
# create a data description table
data_des = pd.DataFrame()
data_des['Features'] = df_reshape.columns
data_des['Description'] = ['price to purchase car', 'price to maintain car', 'size of the luggage boot', 'level of vehicle safety', 'level of car quality']
data_des['Scales'] = ['ordinal'] * 5
data_des['Discrete\Continuous'] = ['discrete'] * 5
data_des['Range'] = ['1: low; 2: medium; 3: high; 4: vhigh', '1: low; 2: medium; 3: high; 4: vhigh', '1: small; 2: medium; 3: big;', '1: low; 2: medium; 3: high;','1: unacceptable; 2: acceptable; 3: good; 4: vgood']
data_des
| Features | Description | Scales | Discrete\Continuous | Range | |
|---|---|---|---|---|---|
| 0 | buying_price | price to purchase car | ordinal | discrete | 1: low; 2: medium; 3: high; 4: vhigh |
| 1 | maintenance_price | price to maintain car | ordinal | discrete | 1: low; 2: medium; 3: high; 4: vhigh |
| 2 | boot_space | size of the luggage boot | ordinal | discrete | 1: small; 2: medium; 3: big; |
| 3 | safety | level of vehicle safety | ordinal | discrete | 1: low; 2: medium; 3: high; |
| 4 | quality | level of car quality | ordinal | discrete | 1: unacceptable; 2: acceptable; 3: good; 4: vgood |
The table above shows the features description, sacles, and range.
# find the duplicate instances
idx = df_reshape.duplicated()
# find the number of duplicate (not first show)
len(df_reshape[idx])
1506
#df_reshape[idx]
After the duplicate check, it shows that 1506 entries in the dataset are duplicate. There are a total of 1728 observations so seeing such a high amount of duplicated entries was alarming. However, keep in mind that all the features in this dataset are categorical so having so many duplicates should be expected. If the features contained continous values then it would be harder to get a duplicate since the range of possible values for each entry increases.
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline
# Lets aggregate by maintenance price and calculate average safety
df_grouped_qual = df.groupby(by='quality')
for val,grp in df_grouped_qual:
print('There were',len(grp),'cars in group',val)
print('---------------------------------------')
print('Note: 1: unnacceptable; 2: acceptable; 3: good; 4: vgood')
print('---------------------------------------')
print('Average maintenance price:')
print(df_grouped_qual.maintenance_price.mean())
There were 1210 cars in group 1 There were 384 cars in group 2 There were 69 cars in group 3 There were 65 cars in group 4 --------------------------------------- Note: 1: unnacceptable; 2: acceptable; 3: good; 4: vgood --------------------------------------- Average maintenance price: quality 1 2.633058 2 2.408854 3 1.333333 4 1.800000 Name: maintenance_price, dtype: float64
# plotting what we previsously grouped
plt.style.use('ggplot')
quality = df_grouped_qual.maintenance_price.mean()
ax = quality.plot(kind='barh')
plt.title('Average Maintenance Price by Quality')
Text(0.5, 1.0, 'Average Maintenance Price by Quality')
The "Average Maintenance Price by Quality" shows that as quality increases the maintenance price decreases. The lowest maintenance price is at a quality of good and the second lowest maintence price is of very good. This could be because cars that are very good quality are newer vehicles that have more features (luxury vehicles) and therefore are more expensive to maintain. Unacceptable and acceptable quality have the highest maintenance rates, with acceptable quality only having a slightly lower maintenance rate.
# Lets aggregate by buying price and calculate average safety
df_grouped_qual = df.groupby(by='quality')
for val,grp in df_grouped_qual:
print('There were',len(grp),'cars in group',val)
print('---------------------------------------')
print('Note: 1: unnacceptable; 2: acceptable; 3: good; 4: vgood')
print('---------------------------------------')
print('Average safety')
print(df_grouped_qual.safety.mean())
There were 1210 cars in group 1 There were 384 cars in group 2 There were 69 cars in group 3 There were 65 cars in group 4 --------------------------------------- Note: 1: unnacceptable; 2: acceptable; 3: good; 4: vgood --------------------------------------- Average safety quality 1 1.752893 2 2.531250 3 2.434783 4 3.000000 Name: safety, dtype: float64
# plotting what we previsously grouped
plt.style.use('ggplot')
quality = df_grouped_qual.safety.mean()
ax = quality.plot(kind='barh')
plt.title('Average Safety by Quality')
Text(0.5, 1.0, 'Average Safety by Quality')
The "Average Safety by Quality" shows that as quality increases the safety of the car also increases. There is a positive correlation between safety and quality. Driving a car is one of the most dangerous activities, so safety is arguably the most important aspect of car ownership and should be an indicator of how good a car is at being a car.
# cross rearrangement by safety, maintenance price, and quality
df_cross = pd.crosstab([df['safety'],
df['maintenance_price']],
df.quality)
print(df_cross)
quality = df_cross.div(df_cross.sum(1).astype(float), axis=0)
# plot resignation_rate
quality.plot(kind='barh', stacked=True)
plt.title('Average Safety and Maintenance Price by Quality')
quality 1 2 3 4
safety maintenance_price
1 1 144 0 0 0
2 144 0 0 0
3 144 0 0 0
4 144 0 0 0
2 1 72 46 26 0
2 72 59 13 0
3 95 49 0 0
4 118 26 0 0
3 1 52 46 20 26
2 52 56 10 26
3 75 56 0 13
4 98 46 0 0
Text(0.5, 1.0, 'Average Safety and Maintenance Price by Quality')
The "Average Safety and Maintenance Price by Quality" shows that as safety and maintenance price increase there are less cars that are unnaceptable quality. All cars with a safety rating of l (low) are unnaceptable quality and cars in this group evaluate to unnaceptable regardless of the maintenance price. At a safety rating of 3 (high), the majorty of cars evaluate to an acceptable, good or very good quality and ss the maintenace price goes down there is an increase of cars that evaluate to good or very good.
# plot the correlation matrix using seaborn
import seaborn as sns
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.set(style="darkgrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(9, 9))
sns.heatmap(df.corr(), cmap=cmap, annot=True)
f.tight_layout()
plt.title('Correlation Matrix Graph')
Text(0.5, 1.0, 'Correlation Matrix Graph')
According to the correlation matrix, boot space and safety are positivily correlated with quality. This means the more safe a car is and the more boot_space the car has the higher the car's quality. Buying and maintenance price are negatively correlated. If the cost to maintain a car goes up, the lower the car's quality.
sns.set()
sns.pairplot(df, hue="quality", height=2)
<seaborn.axisgrid.PairGrid at 0x7f85a4752670>
The scatter plots show that buying price is negatively correlated with quality. This is unexpected because cars that are higher quality will usually cost more. The expectation was a positive correlation between buying price and quality. A higher quality car typically has more safety features, a bigger boot space etcetera and the more features added to the car the more expensive the cost will be to purchase. Buying price is not a great classifier for my target "quality" so I need to transform the features into two principle components using PCA.
from sklearn.decomposition import PCA
from sklearn import preprocessing
# create two data frames seperated by data and target
X = copy.deepcopy(df_reshape)
X.drop(columns=['quality'], axis=1, inplace=True)
X= preprocessing.scale(X)
X=pd.DataFrame(X)
y = df['quality'].copy()
y = y.values.tolist()
target_names = df_original.quality
feature_names = [i for j, i in enumerate(feature_names) if j not in (2,3,6)]
#print(feature_names)
#print(X.head())
#print(y)
#print(target_names.head())
pca = PCA(n_components=2)
pca.fit(X) # fit data and then transform it
X_pca = pca.transform(X)
#print the components
print('pca:', pca.components_)
pca: [[-0. 0.88465174 0.14744196 -0.44232587] [-0. -0.38334909 0.77000945 -0.51002835]]
cmap = sns.set(style="darkgrid")
# this function definition just formats the weights into readable strings
# you can skip it without loss of generality to the Data Science content
def get_feature_names_from_weights(weights, names):
tmp_array = []
for comp in weights:
tmp_string = ''
for fidx,f in enumerate(names):
if fidx>0 and comp[fidx]>=0:
tmp_string+='+'
tmp_string += '%.2f*%s' % (comp[fidx],f)
tmp_array.append(tmp_string)
return tmp_array
plt.style.use('default')
# now let's get to the Data Analytics!
pca_weight_strings = get_feature_names_from_weights(pca.components_, feature_names)
# create some pandas dataframes from the transformed outputs
df_pca = pd.DataFrame(X_pca,columns=[pca_weight_strings])
from matplotlib.pyplot import scatter
print(pca_weight_strings[0])
print(pca_weight_strings[1])
# scatter plot the output, with the names created from the weights
plt.figure(figsize=(10, 10), dpi=80)
ax = scatter(X_pca[:,0], X_pca[:,1], c=y, cmap=cmap)
plt.xlabel(pca_weight_strings[0])
plt.ylabel(pca_weight_strings[1])
-0.00*buying_price+0.88*maintenance_price+0.15*boot_space-0.44*safety -0.00*buying_price-0.38*maintenance_price+0.77*boot_space-0.51*safety
Text(0, 0.5, '-0.00*buying_price-0.38*maintenance_price+0.77*boot_space-0.51*safety')
# manipulated from Sebastian Raschka Example (your book!)
# also from hi blog here: http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Bar, Line
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
pca = PCA(n_components=4)
X_pca = pca.fit(X)
plot_explained_variance(pca)
According the the scree plot, each principle component accounts for 25% of the variation in the data. The scatter plot obtained from performing principle component analysis, uses two principle components. The two components account for 50% of the variation in the data. To provide a good representation of our dataset, there needs to be four principle components
UCI Machine Learning Repository. Car Evaluation. https://archive.ics.uci.edu/ml/datasets/Car+Evaluation (Accessed 03-03-2021)